In part 2, we will continue analyzing the potential relationships between variables that we have interest in. We will conduct 12 statistical analyses in total: 2 Chi-square tests to explore association between two categorical variables(Frisk VS Search, Officer in uniform VS Suspect arrested); 4 linear regressions to find out whether linear relation exists in two numeric variables(Observe time VS Stop time, Stop time VS Suspect reported age, Observe time VS Suspect height, Stop time VS Suspect weight); 6 Comparisons of Means T-test to see whether there is statistical difference between two specific groups in means of the variable studied(means of observe mins/stop mins of 2 different eye colors, means of observe mins/stop mins of thin suspects and heavy suspects, means of observe mins/stop mins of female suspects and male suspects).
We also develop our own methods to define and exclude outliers(since there are extreme large values in our studied variables), and use that to conduct the analysis again after using complete data to do first. Sometimes there is no statistical result found in the complete data, but we got some interesting results after excluding the extreme outliers.
# Set up: load the NYPD dataset and the packages that we will use
NYPD <- read.csv("Data/2018_sqf_database-abbr.csv")
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.6 ✓ dplyr 1.0.9
## ✓ tidyr 1.2.0 ✓ stringr 1.4.0
## ✓ readr 2.1.2 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
head(NYPD)
## STOP_FRISK_ID Stop.Frisk.Time YEAR2 MONTH2 DAY2 STOP_WAS_INITIATED
## 1 1 19:04:00 2018 January Monday Based on C/W on Scene
## 2 2 23:00:00 2018 January Monday Based on Radio Run
## 3 3 23:55:00 2018 January Monday Based on Radio Run
## 4 4 3:23:00 2018 January Monday Based on Radio Run
## 5 5 3:23:00 2018 January Monday Based on Radio Run
## 6 6 21:20:00 2018 January Monday Based on Self Initiated
## ISSUING_OFFICER_RANK ISSUING_OFFICER_COMMAND_CODE SUPERVISING_OFFICER_RANK
## 1 POM 1 SGT
## 2 POM 34 SGT
## 3 POM 808 SGT
## 4 POM 63 SGT
## 5 POM 63 SGT
## 6 POM 71 SGT
## SUPERVISING_OFFICER_COMMAND_CODE OBSERVED_DURATION_MINUTES
## 1 1 0
## 2 34 1
## 3 808 0
## 4 63 2
## 5 63 2
## 6 71 1
## SUSPECTED_CRIME_DESCRIPTION STOP_DURATION_MINUTES
## 1 MENACING 18
## 2 CPW 15
## 3 GRAND LARCENY 10
## 4 ROBBERY 15
## 5 ROBBERY 15
## 6 UNAUTHORIZED USE OF A VEHICLE 15
## OFFICER_EXPLAINED_STOP_FLAG OFFICER_NOT_EXPLAINED_STOP_DESCRIPTION
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## SUSPECT_ARRESTED_FLAG SUSPECT_ARREST_OFFENSE OFFICER_IN_UNIFORM_FLAG
## 1 0 0 1
## 2 0 0 1
## 3 0 0 1
## 4 1 ROBBERY 1
## 5 1 ROBBERY 1
## 6 0 0 1
## FRISKED_FLAG SEARCHED_FLAG OTHER_CONTRABAND_FLAG FIREARM_FLAG
## 1 1 0 0 0
## 2 1 1 0 0
## 3 1 0 0 0
## 4 1 1 0 0
## 5 1 1 0 0
## 6 1 0 1 0
## KNIFE_CUTTER_FLAG OTHER_WEAPON_FLAG WEAPON_FOUND_FLAG PHYSICAL_FORCE_CEW_FLAG
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
## PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG
## 1 0 0
## 2 1 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## PHYSICAL_FORCE_OC_SPRAY_USED_FLAG PHYSICAL_FORCE_OTHER_FLAG
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## PHYSICAL_FORCE_RESTRAINT_USED_FLAG PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG
## 1 0 1
## 2 0 0
## 3 0 1
## 4 1 1
## 5 1 1
## 6 0 1
## BACKROUND_CIRCUMSTANCES_VIOLENT_CRIME_FLAG
## 1 0
## 2 0
## 3 1
## 4 1
## 5 1
## 6 0
## BACKROUND_CIRCUMSTANCES_SUSPECT_KNOWN_TO_CARRY_WEAPON_FLAG
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## SUSPECTS_ACTIONS_CASING_FLAG
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG
## 1 1
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## SUSPECTS_ACTIONS_DECRIPTION_FLAG SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 0
## 5 1 0
## 6 1 0
## SUSPECTS_ACTIONS_IDENTIFY_CRIME_PATTERN_FLAG SUSPECTS_ACTIONS_OTHER_FLAG
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 1
## 5 0 1
## 6 0 0
## SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG SEARCH_BASIS_ADMISSION_FLAG
## 1 1 0
## 2 0 0
## 3 1 0
## 4 0 0
## 5 0 0
## 6 1 0
## SEARCH_BASIS_CONSENT_FLAG SEARCH_BASIS_HARD_OBJECT_FLAG
## 1 0 0
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
## SEARCH_BASIS_INCIDENTAL_TO_ARREST_FLAG SEARCH_BASIS_OTHER_FLAG
## 1 0 0
## 2 0 1
## 3 0 0
## 4 0 1
## 5 0 1
## 6 0 0
## SEARCH_BASIS_OUTLINE_FLAG DEMEANOR_OF_PERSON_STOPPED SUSPECT_REPORTED_AGE
## 1 0 COOPERATIVE 0
## 2 0 NERVOUS 26
## 3 0 RELIEVED 40
## 4 0 COMPLAIN 38
## 5 0 COMPLAINT 36
## 6 0 NERVOUS 20
## SUSPECT_SEX SUSPECT_RACE_DESCRIPTION SUSPECT_HEIGHT SUSPECT_WEIGHT
## 1 MALE WHITE 5.1 170
## 2 MALE BLACK HISPANIC 6.1 250
## 3 MALE WHITE HISPANIC 5.5 150
## 4 MALE AMERICAN INDIAN/ALASKAN NATIVE 5.1 160
## 5 MALE WHITE 5.11 230
## 6 MALE BLACK 5.1 150
## SUSPECT_BODY_BUILD_TYPE SUSPECT_EYE_COLOR SUSPECT_HAIR_COLOR
## 1 MED BRO BLK
## 2 HEA BRO BLK
## 3 THN BRO BLD
## 4 MED BRO BLK
## 5 MED BRO BLK
## 6 THN BRO BLK
## STOP_LOCATION_PRECINCT STOP_LOCATION_FULL_ADDRESS
## 1 1 VARICK STREET && FRANKLIN STREET
## 2 34 DYCKMAN STREET && POST AVENUE
## 3 43 2245 RANDALL AVENUE
## 4 63 EAST 38 STREET && AVENUE L
## 5 63 EAST 38 STREET && AVENUE L
## 6 67 178 ROCKAWAY PARKWAY
## STOP_LOCATION_BORO_NAME
## 1 MANHATTAN
## 2 MANHATTAN
## 3 BRONX
## 4 BROOKLYN
## 5 BROOKLYN
## 6 BROOKLYN
### Linear Regression 1: Observe Time VS. Stop Time
## First roughly observe the linear relationship between Observe Mins and Stop Mins
# scatter plot of observe time(x) vs. stop time(y), done above -- seems not very linear
ggplot(NYPD, aes(OBSERVED_DURATION_MINUTES, STOP_DURATION_MINUTES)) + geom_point() + labs(title="The Relation between Observed Time and Stop Time", subtitle="(with all data)", y="stop duration mins", x="observed duration mins")
# calculate the correlation -- seems a moderate/weak positive correlation
corr_observet_stopt <- cor(NYPD$OBSERVED_DURATION_MINUTES,NYPD$STOP_DURATION_MINUTES)
print(paste("correlation between observe mins and stop mins is",corr_observet_stopt))
## [1] "correlation between observe mins and stop mins is 0.505381065226995"
# generate a line of best fit
ggplot(NYPD, aes(OBSERVED_DURATION_MINUTES, STOP_DURATION_MINUTES)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
## Build the linear regression model and check whether the residuals fit the assumptions
observet_stopt_model <- lm(STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, data = NYPD)
print(observet_stopt_model)
##
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES,
## data = NYPD)
##
## Coefficients:
## (Intercept) OBSERVED_DURATION_MINUTES
## 11.372070 0.009779
# generate histogram, normal probability plot of residuals and a residual plot -- the normal probability plot is not straight at the right end, and the residuals are not very randomly scattered, so the linear regression model may not be appropriate in this case. Or it could be mainly due to extremely large outliers.
residuals<-resid(observet_stopt_model)
hist(residuals)
qqnorm(residuals)
plot(fitted(observet_stopt_model), residuals)
# calculate and plot the standard residuals:
standard_residuals <- rstandard(observet_stopt_model)
plot(fitted(observet_stopt_model), standard_residuals)
## Data Extraction: exclude outliers whose standard residuals are greater than 2:
observet_stopt <- NYPD[c("OBSERVED_DURATION_MINUTES", "STOP_DURATION_MINUTES")]
observet_stopt_stdres <- cbind(observet_stopt, standard_residuals)
dim(observet_stopt_stdres)
## [1] 11008 3
observet_stopt_analysis <- observet_stopt_stdres[observet_stopt_stdres$standard_residuals < 2,]
dim(observet_stopt_analysis)
## [1] 10763 3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is much higher than 0.05, so there is no statistical evidence that there is a linear relationship between observed time and stop time
summary(lm(STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, data = observet_stopt_analysis))
##
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES,
## data = observet_stopt_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.791 -4.791 -1.791 3.209 34.209
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.7908003 0.0762773 128.36 <2e-16 ***
## OBSERVED_DURATION_MINUTES -0.0002305 0.0002993 -0.77 0.441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.908 on 10761 degrees of freedom
## Multiple R-squared: 5.513e-05, Adjusted R-squared: -3.779e-05
## F-statistic: 0.5933 on 1 and 10761 DF, p-value: 0.4412
### We would do the linear regression analysis again by excluding the outliers in Observe Time and Stop Time based on the methods we defined previously
### Outliers: Observe mins >35, Stop mins >50
exclude_observet_stopt <- NYPD[NYPD$OBSERVED_DURATION_MINUTES <=35 & NYPD$STOP_DURATION_MINUTES <=50,]
dim(exclude_observet_stopt)
## [1] 10795 59
dim(NYPD)
## [1] 11008 59
# Build the linear regression model and check whether the residuals fit the assumptions
exclude_observet_stopt_model <- lm(STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, data = exclude_observet_stopt)
print(exclude_observet_stopt_model)
##
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES,
## data = exclude_observet_stopt)
##
## Coefficients:
## (Intercept) OBSERVED_DURATION_MINUTES
## 9.6704 0.2007
# generate histogram, normal probability plot of residuals and a residual plot
residuals<-resid(exclude_observet_stopt_model)
hist(residuals)
qqnorm(residuals)
plot(fitted(exclude_observet_stopt_model), residuals)
# calculate and plot the standard residuals:
standard_residuals <- rstandard(exclude_observet_stopt_model)
plot(fitted(exclude_observet_stopt_model), standard_residuals)
## Data Extraction: exclude outliers whose standard residuals are greater than 2:
exclude_observet_stopt <- exclude_observet_stopt[c("OBSERVED_DURATION_MINUTES", "STOP_DURATION_MINUTES")]
exclude_observet_stopt_stdres <- cbind(exclude_observet_stopt, standard_residuals)
dim(exclude_observet_stopt_stdres)
## [1] 10795 3
exclude_observet_stopt_analysis <- exclude_observet_stopt_stdres[exclude_observet_stopt_stdres$standard_residuals < 2,]
dim(exclude_observet_stopt_analysis)
## [1] 10171 3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is very small(much smaller than 0.05), so there is statistical evidence that stop mins is positively linear ro observed mins, after we excluding the outliers. The slope coefficient is about 0.25.
summary(lm(STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES, data = exclude_observet_stopt_analysis))
##
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ OBSERVED_DURATION_MINUTES,
## data = exclude_observet_stopt_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.630 -3.813 -1.843 1.944 18.439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.05569 0.06622 121.65 <2e-16 ***
## OBSERVED_DURATION_MINUTES 0.25249 0.01420 17.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.937 on 10169 degrees of freedom
## Multiple R-squared: 0.03015, Adjusted R-squared: 0.03005
## F-statistic: 316.1 on 1 and 10169 DF, p-value: < 2.2e-16
cor(exclude_observet_stopt_analysis$OBSERVED_DURATION_MINUTES,exclude_observet_stopt_analysis$STOP_DURATION_MINUTES)
## [1] 0.1736249
### Linear Regression 2: Stop Time VS. Suspect Reported Age
## First roughly observe the linear relationship between Stop Mins and Suspect Reported Age
# scatter plot of suspect reported age(x) vs. stop time(y), done above -- it is not linear by observation
ggplot(NYPD, aes(SUSPECT_REPORTED_AGE, STOP_DURATION_MINUTES)) + geom_point() + labs(title="Relation between Suspect Reported Age and Stop Time ", y="stop duration mins", x="suspect reported age")
# Here we use the same way as previously in part 1 to define and exclude outliers based on context; outliers: stop mins >50
sample_age_stopt <- NYPD[!NYPD$SUSPECT_REPORTED_AGE==0 & NYPD$STOP_DURATION_MINUTES<=50,]
# calculate the correlation -- seems a very weak positive correlation
corr_age_stopt <- cor(sample_age_stopt$SUSPECT_REPORTED_AGE, sample_age_stopt$STOP_DURATION_MINUTES)
print(paste("correlation between suspect reported age and stop mins is",corr_age_stopt))
## [1] "correlation between suspect reported age and stop mins is -0.00897348201582471"
# generate a line of best fit -- the line is approximately horizontal, meaning stop mins may not change a lot as age varies
ggplot(sample_age_stopt, aes(SUSPECT_REPORTED_AGE, STOP_DURATION_MINUTES)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
## Build the linear regression model and check whether the residuals fit the assumptions
age_stopt_model <- lm(STOP_DURATION_MINUTES ~ SUSPECT_REPORTED_AGE, data = sample_age_stopt)
print(age_stopt_model)
##
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ SUSPECT_REPORTED_AGE, data = sample_age_stopt)
##
## Coefficients:
## (Intercept) SUSPECT_REPORTED_AGE
## 10.412864 -0.006322
# generate histogram, normal probability plot of residuals and a residual plot -- the histogram is not normally distributed and the normal probability plot is not approximately a straight line, so the linear regression model may not be appropriate in this case.
residuals<-resid(age_stopt_model)
hist(residuals)
qqnorm(residuals)
plot(fitted(age_stopt_model), residuals)
# Calculate and plot the standard residuals
standard_residuals <- rstandard(age_stopt_model)
plot(fitted(age_stopt_model), standard_residuals)
length(standard_residuals)
## [1] 10056
length(sample_age_stopt$STOP_DURATION_MINUTES)
## [1] 10056
## Data Extraction: exclude outliers whose standard residuals are greater than 2:
age_stopt <- sample_age_stopt[c("SUSPECT_REPORTED_AGE", "STOP_DURATION_MINUTES")]
age_stopt_stdres <- cbind(age_stopt, standard_residuals)
dim(age_stopt_stdres)
## [1] 10056 3
age_stopt_analysis <- age_stopt_stdres[age_stopt_stdres$standard_residuals < 2,]
dim(age_stopt_analysis)
## [1] 9399 3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is 0.03, which is less than 0.05, so we could reject the null hypothesis here in favor of alternative hypothesis. That is, there is statistical significance that suspect reported age is negatively linear to stop time.
summary(lm(STOP_DURATION_MINUTES ~ SUSPECT_REPORTED_AGE, data = age_stopt_analysis))
##
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ SUSPECT_REPORTED_AGE, data = age_stopt_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.706 -3.674 -1.621 1.706 18.727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.853653 0.155688 56.868 <2e-16 ***
## SUSPECT_REPORTED_AGE -0.010552 0.004893 -2.156 0.0311 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.833 on 9397 degrees of freedom
## Multiple R-squared: 0.0004946, Adjusted R-squared: 0.0003883
## F-statistic: 4.65 on 1 and 9397 DF, p-value: 0.03107
### Linear Regression 3: Observe Time VS. Suspect Height
## First roughly observe the linear relationship between Observe Mins and Suspect Height
# scatter plot of height(x) vs. observe time(y), done above -- it is not linear by observation
ggplot(NYPD, aes(SUSPECT_HEIGHT, OBSERVED_DURATION_MINUTES)) + geom_point() + labs(title="Relation between Suspect Height and Observed Time ", y="observed duration mins", x="suspect height")
# Here we use the same way as previously in part 1 to define and exclude outliers based on context; outliers: observed mins >35
NYPD$SUSPECT_HEIGHT <- as.numeric(NYPD$SUSPECT_HEIGHT)
## Warning: NAs introduced by coercion
sample_h_observet <- NYPD[!is.na(NYPD$SUSPECT_HEIGHT) & NYPD$OBSERVED_DURATION_MINUTES<=35,]
# calculate the correlation -- seems a very weak negative correlation
corr_h_observet <- cor(sample_h_observet$SUSPECT_HEIGHT, sample_h_observet$OBSERVED_DURATION_MINUTES)
print(paste("correlation between suspect height and observed mins is",corr_h_observet))
## [1] "correlation between suspect height and observed mins is -0.00223632383257302"
# generate a line of best fit -- the line is approximately horizontal, meaning observed mins may not change a lot as height varies
ggplot(sample_h_observet, aes(SUSPECT_HEIGHT, OBSERVED_DURATION_MINUTES)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
## Build the linear regression model and check whether the residuals fit the assumptions
h_observet_model <- lm(OBSERVED_DURATION_MINUTES ~ SUSPECT_HEIGHT, data = sample_h_observet)
print(h_observet_model)
##
## Call:
## lm(formula = OBSERVED_DURATION_MINUTES ~ SUSPECT_HEIGHT, data = sample_h_observet)
##
## Coefficients:
## (Intercept) SUSPECT_HEIGHT
## 2.23066 -0.02331
# generate histogram, normal probability plot of residuals and a residual plot -- the histogram is not normally distributed and the normal probability plot is not approximately a straight line, so the linear regression model may not be appropriate in this case.
residuals<-resid(h_observet_model)
hist(residuals)
qqnorm(residuals)
plot(fitted(h_observet_model), residuals)
# Calculate and plot the standard residuals
standard_residuals <- rstandard(h_observet_model)
plot(fitted(h_observet_model), standard_residuals)
## Data Extraction: exclude outliers whose standard residuals are greater than 2:
h_observet <- sample_h_observet[c("SUSPECT_HEIGHT", "OBSERVED_DURATION_MINUTES")]
h_observet_stdres <- cbind(h_observet, standard_residuals)
dim(h_observet_stdres)
## [1] 10548 3
h_observet_analysis <- h_observet_stdres[h_observet_stdres$standard_residuals < 2,]
dim(h_observet_analysis)
## [1] 10295 3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is much higher than 0.05, so there is no statistical evidence that there is a linear relationship between suspect heights and observed time
summary(lm(OBSERVED_DURATION_MINUTES ~ SUSPECT_HEIGHT, data = h_observet_analysis))
##
## Call:
## lm(formula = OBSERVED_DURATION_MINUTES ~ SUSPECT_HEIGHT, data = h_observet_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5806 -0.5625 -0.5522 0.4271 8.4736
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.40768 0.25068 5.615 2.01e-08 ***
## SUSPECT_HEIGHT 0.02581 0.04434 0.582 0.56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.733 on 10293 degrees of freedom
## Multiple R-squared: 3.293e-05, Adjusted R-squared: -6.423e-05
## F-statistic: 0.3389 on 1 and 10293 DF, p-value: 0.5605
### Linear Regression 4: Stop Time VS. Suspect Weight
## First roughly observe the linear relationship between Stop Mins and Suspect Weight
# scatter plot of suspect weight(x) vs. stop time(y), done above -- it is not linear by observation
ggplot(NYPD, aes(SUSPECT_WEIGHT, STOP_DURATION_MINUTES)) + geom_point() + labs(title="Relation between Suspect Weight and Stop Time ", y="stop duration mins", x="suspect weight")
# Here we use the same way as previously in part 1 to define and exclude outliers based on context; outliers: stop mins >50
NYPD$SUSPECT_WEIGHT <- as.numeric(NYPD$SUSPECT_WEIGHT)
## Warning: NAs introduced by coercion
sample_w_stopt <- NYPD[!is.na(NYPD$SUSPECT_WEIGHT) & NYPD$STOP_DURATION_MINUTES<=50,]
# calculate the correlation -- seems a very weak negative correlation
corr_w_stopt <- cor(sample_w_stopt$SUSPECT_WEIGHT, sample_w_stopt$STOP_DURATION_MINUTES)
print(paste("correlation between suspect weight and stop mins is",corr_w_stopt))
## [1] "correlation between suspect weight and stop mins is -0.0212103415853416"
# generate a line of best fit -- the line is downward light sloping
ggplot(sample_w_stopt, aes(SUSPECT_WEIGHT, STOP_DURATION_MINUTES)) + geom_point() + geom_smooth(method="lm")
## `geom_smooth()` using formula 'y ~ x'
## Build the linear regression model and check whether the residuals fit the assumptions
w_stopt_model <- lm(STOP_DURATION_MINUTES ~ SUSPECT_WEIGHT, data = sample_w_stopt)
print(w_stopt_model)
##
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ SUSPECT_WEIGHT, data = sample_w_stopt)
##
## Coefficients:
## (Intercept) SUSPECT_WEIGHT
## 10.946666 -0.005192
# generate histogram, normal probability plot of residuals and a residual plot -- the histogram is not normally distributed, the residuals are not very randomly scattered, and the normal probability plot is not approximately a straight line, so the linear regression model may not be appropriate in this case.
residuals<-resid(w_stopt_model)
hist(residuals)
qqnorm(residuals)
plot(fitted(w_stopt_model), residuals)
# Calculate and plot the standard residuals
standard_residuals <- rstandard(w_stopt_model)
plot(fitted(w_stopt_model), standard_residuals)
## Data Extraction: exclude outliers whose standard residuals are greater than 2:
w_stopt <- sample_w_stopt[c("SUSPECT_WEIGHT", "STOP_DURATION_MINUTES")]
w_stopt_stdres <- cbind(w_stopt, standard_residuals)
dim(w_stopt_stdres)
## [1] 10451 3
w_stopt_analysis <- w_stopt_stdres[w_stopt_stdres$standard_residuals < 2,]
dim(w_stopt_analysis)
## [1] 9806 3
## Summary of Linear Regression Analysis: the p-value for the slope coefficient is 0.04, which is less than 0.05, so we could reject the null hypothesis here in favor of alternative hypothesis. That is, there is statistical significance that suspect weight is negatively linear to stop time.
summary(lm(STOP_DURATION_MINUTES ~ SUSPECT_WEIGHT, data = w_stopt_analysis))
##
## Call:
## lm(formula = STOP_DURATION_MINUTES ~ SUSPECT_WEIGHT, data = w_stopt_analysis)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.671 -3.568 -1.568 1.706 18.603
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.047370 0.289084 31.297 <2e-16 ***
## SUSPECT_WEIGHT -0.003423 0.001674 -2.045 0.0409 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.793 on 9804 degrees of freedom
## Multiple R-squared: 0.0004263, Adjusted R-squared: 0.0003243
## F-statistic: 4.181 on 1 and 9804 DF, p-value: 0.0409
## Chi-square Test 1: Frisked Flag VS. Searched Flag
# Null hypothesis: There is no association between category A and category B
# Alternative hypothesis: There is an association between category A and category B
#install.packages("dplyr")
#library(dplyr)
#Select the variables we want to analyze: FRISKED_FLAG, SEARCHED_FLAG
analysis_vars <- NYPD %>%
select(FRISKED_FLAG, SEARCHED_FLAG) %>%
filter(FRISKED_FLAG %in% c("0", "1"), SEARCHED_FLAG %in% c("0", "1"))
head(analysis_vars)
## FRISKED_FLAG SEARCHED_FLAG
## 1 1 0
## 2 1 1
## 3 1 0
## 4 1 1
## 5 1 1
## 6 1 0
#Let's investigate to see if there is an association between FRISKED_FLAG and SEARCHED_FLAG
frisked_searched <- table(analysis_vars$FRISKED_FLAG, analysis_vars$SEARCHED_FLAG)
print(frisked_searched)
##
## 0 1
## 0 3444 1045
## 1 3896 2623
#Conduct a Chi-Squared Test of Independence
chisq.test(analysis_vars$FRISKED_FLAG, analysis_vars$SEARCHED_FLAG)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: analysis_vars$FRISKED_FLAG and analysis_vars$SEARCHED_FLAG
## X-squared = 343.28, df = 1, p-value < 2.2e-16
## As we can see, the p-value here is greatly larger than 0.05. Thus, we cannot refuse the Null hypothesis, there is no association between frisked flag and searched flag.
## Chi-square Test 2: Officer in Uniform Flag VS. Suspect Arrested Flag
#Select the variables we want to analyze: OFFICER_IN_UNIFORM_FLAG, SUSPECT_ARRESTED_FLAG
analysis_vars2 <- NYPD %>%
select(OFFICER_IN_UNIFORM_FLAG, SUSPECT_ARRESTED_FLAG) %>%
filter(OFFICER_IN_UNIFORM_FLAG %in% c("0", "1"), SUSPECT_ARRESTED_FLAG %in% c("0", "1"))
head(analysis_vars2)
## OFFICER_IN_UNIFORM_FLAG SUSPECT_ARRESTED_FLAG
## 1 1 0
## 2 1 0
## 3 1 0
## 4 1 1
## 5 1 1
## 6 1 0
#Let's investigate to see if there is an association between OFFICER_IN_UNIFORM_FLAG and SUSPECT_ARRESTED_FLAG
uniform_arrested <- table(analysis_vars2$OFFICER_IN_UNIFORM_FLAG, analysis_vars2$SUSPECT_ARRESTED_FLAG)
print(uniform_arrested)
##
## 0 1
## 0 1909 959
## 1 5984 2156
#Conduct a Chi-Squared Test of Independence
chisq.test(analysis_vars2$OFFICER_IN_UNIFORM_FLAG, analysis_vars2$SUSPECT_ARRESTED_FLAG)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: analysis_vars2$OFFICER_IN_UNIFORM_FLAG and analysis_vars2$SUSPECT_ARRESTED_FLAG
## X-squared = 50.166, df = 1, p-value = 1.413e-12
## As we can see, the p-value here is also larger than 0.05. Thus, we cannot refuse the Null hypothesis, there is no association between the officer in uniform flag and the suspect arrested flag.
### Comparison of Means T-test 1: mean observed duration time of 2 eye colors (a common eye color -- Brown and an uncommon eye color -- Blue)
## how we define common and uncommon: 9024 observations of "brown eye", 157 observations of "blue eye"
unique(NYPD$SUSPECT_EYE_COLOR)
## [1] "BRO" "BLK" "GRY" "HAZ" "ZZZ" "BLU" "0" "GRN" "MUL" "OTH"
table(NYPD$SUSPECT_EYE_COLOR)
##
## 0 BLK BLU BRO GRN GRY HAZ MUL OTH ZZZ
## 210 1025 157 9024 98 21 91 5 2 375
## subset the vectors that we are going to do the t-test
# We use the same method as previously in part 1 to exclude outliers in observed time(>35). We have to exclude that before the test, otherwise the mean would be severely raised by the outliers
sample_eye_observet <- NYPD %>%
select(OBSERVED_DURATION_MINUTES, SUSPECT_EYE_COLOR) %>%
filter(SUSPECT_EYE_COLOR %in% c("BRO", "BLU"), OBSERVED_DURATION_MINUTES<=35)
## Calculate the mean observed time by eye color
mean_observet_by_eyecolor <- sample_eye_observet %>%
group_by(SUSPECT_EYE_COLOR) %>%
summarize(mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(mean_observet_by_eyecolor)
## # A tibble: 2 × 2
## SUSPECT_EYE_COLOR mean_observet
## <chr> <dbl>
## 1 BLU 2.29
## 2 BRO 2.10
## Create a comparative box plot of Observed Duration Minutes by Eye Color
ggplot(sample_eye_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_EYE_COLOR)) + geom_boxplot(colour = "blue", fill="grey") + ggtitle("Observed Time by Eye Color")
## Conduct a comparison of means t-test:
t.test(OBSERVED_DURATION_MINUTES~SUSPECT_EYE_COLOR, data=sample_eye_observet, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: OBSERVED_DURATION_MINUTES by SUSPECT_EYE_COLOR
## t = 0.54239, df = 160.81, p-value = 0.5883
## alternative hypothesis: true difference in means between group BLU and group BRO is not equal to 0
## 95 percent confidence interval:
## -0.5004594 0.8794577
## sample estimates:
## mean in group BLU mean in group BRO
## 2.292994 2.103494
# From the t-test result we can see that the p-value is much higher than 0.05 which implies there is no statistical significant difference in observed time means of the 2 different eye colors "blue" and "brown"; the 95% confidence intervals also include 0(no difference). We cannot reject null hypothesis.
### Comparison of Means T-test 2: mean stop duration time of 2 eye colors(a common eye color -- Brown and an uncommon eye color -- Blue)
## how we define common and uncommon: 9024 observations of "brown eye", 157 observations of "blue eye"
unique(NYPD$SUSPECT_EYE_COLOR)
## [1] "BRO" "BLK" "GRY" "HAZ" "ZZZ" "BLU" "0" "GRN" "MUL" "OTH"
table(NYPD$SUSPECT_EYE_COLOR)
##
## 0 BLK BLU BRO GRN GRY HAZ MUL OTH ZZZ
## 210 1025 157 9024 98 21 91 5 2 375
## subset the vectors that we are going to do the t-test
# We use the same method as previously in part 1 to exclude outliers in stop time(>50). We have to exclude that before the test, otherwise the mean would be severely raised by the outliers
sample_eye_stopt <- NYPD %>%
select(STOP_DURATION_MINUTES, SUSPECT_EYE_COLOR) %>%
filter(SUSPECT_EYE_COLOR %in% c("BRO", "BLU"), STOP_DURATION_MINUTES<=50)
## Calculate the mean stop time by eye color
mean_stopt_by_eyecolor <- sample_eye_stopt %>%
group_by(SUSPECT_EYE_COLOR) %>%
summarize(mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(mean_stopt_by_eyecolor)
## # A tibble: 2 × 2
## SUSPECT_EYE_COLOR mean_stopt
## <chr> <dbl>
## 1 BLU 10.9
## 2 BRO 10.1
## Create a comparative box plot of Stop Duration Minutes by Eye Color
ggplot(sample_eye_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_EYE_COLOR)) + geom_boxplot(colour = "blue", fill="grey") + ggtitle("Stop Time by Eye Color")
## Conduct a comparison of means t-test:
t.test(STOP_DURATION_MINUTES~SUSPECT_EYE_COLOR, data=sample_eye_stopt, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: STOP_DURATION_MINUTES by SUSPECT_EYE_COLOR
## t = 1.2761, df = 160.74, p-value = 0.2038
## alternative hypothesis: true difference in means between group BLU and group BRO is not equal to 0
## 95 percent confidence interval:
## -0.4737946 2.2043389
## sample estimates:
## mean in group BLU mean in group BRO
## 10.92308 10.05780
# From the t-test result we can see that p-value is higher than 0.05. This implies that there is no statistical significant difference in observed time means of the 2 different eye colors "blue" and "brown"; the 95% confidence intervals also include 0(no difference). We cannot reject null hypothesis.
### Comparison of Means T-test 3: mean observed duration time of thin suspects and heavy suspects (body build type)
unique(NYPD$SUSPECT_BODY_BUILD_TYPE)
## [1] "MED" "HEA" "THN" "XXX" "U" "0"
table(NYPD$SUSPECT_BODY_BUILD_TYPE) # Thin suspects are far more than heavy suspects.
##
## 0 HEA MED THN U XXX
## 148 1056 4107 5318 331 48
# We will use all the data to conduct the test first, and then we do again without outliers(observed mins >35)
## subset the vectors that we are going to do the t-test
all_sample_body_observet <- NYPD %>%
select(OBSERVED_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE) %>%
filter(SUSPECT_BODY_BUILD_TYPE %in% c("HEA", "THN"))
## Calculate the mean observed time by body build type
all_mean_observet_by_body <- all_sample_body_observet %>%
group_by(SUSPECT_BODY_BUILD_TYPE) %>%
summarize(all_mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(all_mean_observet_by_body)
## # A tibble: 2 × 2
## SUSPECT_BODY_BUILD_TYPE all_mean_observet
## <chr> <dbl>
## 1 HEA 11.0
## 2 THN 33.0
## Create a comparative box plot of Observed Duration Minutes by Body Build Type
ggplot(all_sample_body_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Observed Time by Body Build Type")
## Conduct a comparison of means t-test:
t.test(OBSERVED_DURATION_MINUTES~SUSPECT_BODY_BUILD_TYPE, data=all_sample_body_observet, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: OBSERVED_DURATION_MINUTES by SUSPECT_BODY_BUILD_TYPE
## t = -1.1141, df = 5673.9, p-value = 0.2653
## alternative hypothesis: true difference in means between group HEA and group THN is not equal to 0
## 95 percent confidence interval:
## -60.83788 16.74712
## sample estimates:
## mean in group HEA mean in group THN
## 10.99148 33.03686
# From the t-test result we can see that the p-value is much higher than 0.05 which implies there is no statistical significant difference in observed time means of thin suspects and heavy suspects(with all data including outliers). Even though we could see obvious difference in the computed means(Thin 33.04 mins, Heavy 10.99 mins), we cannot conclude the mean observed time is different in these 2 groups.
# Then do it again excluding outliers: we use the same method as previously in part 1 to exclude outliers in observed time(>35), which is defined according to context.
sample_body_observet <- NYPD %>%
select(OBSERVED_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE) %>%
filter(SUSPECT_BODY_BUILD_TYPE %in% c("HEA", "THN"), OBSERVED_DURATION_MINUTES<=35)
mean_observet_by_body <- sample_body_observet %>%
group_by(SUSPECT_BODY_BUILD_TYPE) %>%
summarize(mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(mean_observet_by_body)
## # A tibble: 2 × 2
## SUSPECT_BODY_BUILD_TYPE mean_observet
## <chr> <dbl>
## 1 HEA 1.96
## 2 THN 1.96
ggplot(sample_body_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Observed Time by Body Build Type")
t.test(OBSERVED_DURATION_MINUTES~SUSPECT_BODY_BUILD_TYPE, data=sample_body_observet, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: OBSERVED_DURATION_MINUTES by SUSPECT_BODY_BUILD_TYPE
## t = 0.023353, df = 1448, p-value = 0.9814
## alternative hypothesis: true difference in means between group HEA and group THN is not equal to 0
## 95 percent confidence interval:
## -0.2395527 0.2453253
## sample estimates:
## mean in group HEA mean in group THN
## 1.961796 1.958909
# The p-value is much higher this time(0.98 >>> 0.05), we are still rejecting the null hypothesis. From the 2 tests above, we could conclude there is no statistical significant difference in observed duration minutes of thin and heavy suspects.
### Comparison of Means T-test 4: mean stop duration time of thin suspects and heavy suspects (body build type)
unique(NYPD$SUSPECT_BODY_BUILD_TYPE)
## [1] "MED" "HEA" "THN" "XXX" "U" "0"
table(NYPD$SUSPECT_BODY_BUILD_TYPE) # Thin suspects are far more than heavy suspects.
##
## 0 HEA MED THN U XXX
## 148 1056 4107 5318 331 48
# We will use all the data to conduct the test first, and then we do again without outliers(stop mins >50)
## subset the vectors that we are going to do the t-test
all_sample_body_stopt <- NYPD %>%
select(STOP_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE) %>%
filter(SUSPECT_BODY_BUILD_TYPE %in% c("HEA", "THN"))
## Calculate the mean stop time by body build type
all_mean_stopt_by_body <- all_sample_body_stopt %>%
group_by(SUSPECT_BODY_BUILD_TYPE) %>%
summarize(all_mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(all_mean_stopt_by_body)
## # A tibble: 2 × 2
## SUSPECT_BODY_BUILD_TYPE all_mean_stopt
## <chr> <dbl>
## 1 HEA 10.9
## 2 THN 12.2
## Create a comparative box plot of Stop Duration Minutes by Body Build Type
ggplot(all_sample_body_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Stop Time by Body Build Type")
## Conduct a comparison of means t-test:
t.test(STOP_DURATION_MINUTES~SUSPECT_BODY_BUILD_TYPE, data=all_sample_body_stopt, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: STOP_DURATION_MINUTES by SUSPECT_BODY_BUILD_TYPE
## t = -2.4166, df = 2772.8, p-value = 0.01573
## alternative hypothesis: true difference in means between group HEA and group THN is not equal to 0
## 95 percent confidence interval:
## -2.2316900 -0.2323622
## sample estimates:
## mean in group HEA mean in group THN
## 10.92330 12.15532
# From the t-test result we see, p-value is smaller than 0.05 which implies there is statistical significant difference in stop time means of thin suspects and heavy suspects(with all data including outliers).
# Then do it again excluding outliers: we use the same method as previously in part 1 to exclude outliers in stop time(>50), which is defined according to context.
sample_body_stopt <- NYPD %>%
select(STOP_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE) %>%
filter(SUSPECT_BODY_BUILD_TYPE %in% c("HEA", "THN"), STOP_DURATION_MINUTES<=50)
mean_stopt_by_body <- sample_body_stopt %>%
group_by(SUSPECT_BODY_BUILD_TYPE) %>%
summarize(mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(mean_stopt_by_body)
## # A tibble: 2 × 2
## SUSPECT_BODY_BUILD_TYPE mean_stopt
## <chr> <dbl>
## 1 HEA 9.98
## 2 THN 10.1
ggplot(sample_body_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_BODY_BUILD_TYPE)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Stop Time by Body Build Type")
t.test(STOP_DURATION_MINUTES~SUSPECT_BODY_BUILD_TYPE, data=sample_body_stopt, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: STOP_DURATION_MINUTES by SUSPECT_BODY_BUILD_TYPE
## t = -0.27834, df = 1478, p-value = 0.7808
## alternative hypothesis: true difference in means between group HEA and group THN is not equal to 0
## 95 percent confidence interval:
## -0.6441890 0.4840885
## sample estimates:
## mean in group HEA mean in group THN
## 9.980788 10.060838
# The p-value is much higher this time. Thus, we are rejecting the null hypothesis.
### Comparison of Means T-test 5: mean observed duration time of male suspects and female suspects (sex)
unique(NYPD$SUSPECT_SEX)
## [1] "MALE" "FEMALE" "0"
table(NYPD$SUSPECT_SEX) # Male suspects are far more than female suspects.
##
## 0 FEMALE MALE
## 69 1014 9925
# We will use all the data to conduct the test first, and then we do again without outliers(observed mins >35)
## subset the vectors that we are going to do the t-test
all_sample_sex_observet <- NYPD %>%
select(OBSERVED_DURATION_MINUTES, SUSPECT_SEX) %>%
filter(SUSPECT_SEX %in% c("MALE", "FEMALE"))
## Calculate the mean observed time by body build type
all_mean_observet_by_sex <- all_sample_sex_observet %>%
group_by(SUSPECT_SEX) %>%
summarize(all_mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(all_mean_observet_by_sex)
## # A tibble: 2 × 2
## SUSPECT_SEX all_mean_observet
## <chr> <dbl>
## 1 FEMALE 6.79
## 2 MALE 23.3
## Create a comparative box plot of Observed Duration Minutes by Body Build Type
ggplot(all_sample_sex_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_SEX)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Observed Time by Sex")
## Conduct a comparison of means t-test:
t.test(OBSERVED_DURATION_MINUTES~SUSPECT_SEX, data=all_sample_sex_observet, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: OBSERVED_DURATION_MINUTES by SUSPECT_SEX
## t = -1.5454, df = 10611, p-value = 0.1223
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
## -37.434521 4.428841
## sample estimates:
## mean in group FEMALE mean in group MALE
## 6.793886 23.296725
# From the t-test result we see that the p-value is still higher than 0.05, which implies there is no statistical significant difference in observed time means of male suspects and female suspects(with all data including outliers). Even though we could see obvious difference in the computed means(Female 6.79 mins, Male 23.30 mins), we cannot conclude the mean observed time is different in these 2 groups.
# Then do it again excluding outliers: we use the same method as previously in part 1 to exclude outliers in observed time(>35), which is defined according to context.
sample_sex_observet <- NYPD %>%
select(OBSERVED_DURATION_MINUTES, SUSPECT_SEX) %>%
filter(SUSPECT_SEX %in% c("MALE", "FEMALE"), OBSERVED_DURATION_MINUTES<=35)
mean_observet_by_sex <- sample_sex_observet %>%
group_by(SUSPECT_SEX) %>%
summarize(mean_observet = mean(OBSERVED_DURATION_MINUTES, na.rm=TRUE))
print(mean_observet_by_sex)
## # A tibble: 2 × 2
## SUSPECT_SEX mean_observet
## <chr> <dbl>
## 1 FEMALE 3.02
## 2 MALE 2.01
ggplot(sample_sex_observet, aes(OBSERVED_DURATION_MINUTES, SUSPECT_SEX)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Observed Time by Sex")
t.test(OBSERVED_DURATION_MINUTES~SUSPECT_SEX, data=sample_sex_observet, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: OBSERVED_DURATION_MINUTES by SUSPECT_SEX
## t = 4.9617, df = 1080.2, p-value = 8.115e-07
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
## 0.6127442 1.4144164
## sample estimates:
## mean in group FEMALE mean in group MALE
## 3.020792 2.007212
# The p-value is much higher this time, we are still rejecting the null hypothesis. From the 2 tests above, we could conclude there is no statistical significant difference in observed duration minutes of thin and heavy suspects.
### Comparison of Means T-test 6: mean stop duration time of male suspects and female suspects (sex)
unique(NYPD$SUSPECT_SEX)
## [1] "MALE" "FEMALE" "0"
table(NYPD$SUSPECT_SEX) # Male suspects are far more than female suspects.
##
## 0 FEMALE MALE
## 69 1014 9925
# We will use all the data to conduct the test first, and then we do again without outliers(stop mins >50)
## subset the vectors that we are going to do the t-test
all_sample_sex_stopt <- NYPD %>%
select(STOP_DURATION_MINUTES, SUSPECT_SEX) %>%
filter(SUSPECT_SEX %in% c("MALE", "FEMALE"))
## Calculate the mean stop time by body build type
all_mean_stopt_by_sex <- all_sample_sex_stopt %>%
group_by(SUSPECT_SEX) %>%
summarize(all_mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(all_mean_stopt_by_sex)
## # A tibble: 2 × 2
## SUSPECT_SEX all_mean_stopt
## <chr> <dbl>
## 1 FEMALE 13.5
## 2 MALE 11.4
## Create a comparative box plot of Stop Duration Minutes by Body Build Type
ggplot(all_sample_sex_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_SEX)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Stop Time by Sex")
## Conduct a comparison of means t-test:
t.test(STOP_DURATION_MINUTES~SUSPECT_SEX, data=all_sample_sex_stopt, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: STOP_DURATION_MINUTES by SUSPECT_SEX
## t = 4.5599, df = 1516.6, p-value = 5.528e-06
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
## 1.193524 2.995497
## sample estimates:
## mean in group FEMALE mean in group MALE
## 13.47436 11.37985
# From the t-test result we see that the p-value is much higher than 0.05 which implies there is no statistical significant difference in stop time means of male suspects and female suspects(with all data including outliers).
# Then do it again excluding outliers: we use the same method as previously in part 1 to exclude outliers in stop time(>50), which is defined according to context.
sample_sex_stopt <- NYPD %>%
select(STOP_DURATION_MINUTES, SUSPECT_SEX) %>%
filter(SUSPECT_SEX %in% c("MALE", "FEMALE"), STOP_DURATION_MINUTES<=50)
mean_stopt_by_sex <- sample_sex_stopt %>%
group_by(SUSPECT_SEX) %>%
summarize(mean_stopt = mean(STOP_DURATION_MINUTES, na.rm=TRUE))
print(mean_stopt_by_sex)
## # A tibble: 2 × 2
## SUSPECT_SEX mean_stopt
## <chr> <dbl>
## 1 FEMALE 12.2
## 2 MALE 9.86
ggplot(sample_sex_stopt, aes(STOP_DURATION_MINUTES, SUSPECT_SEX)) + geom_boxplot(colour = "orange", fill="grey") + ggtitle("Stop Time by Sex")
t.test(STOP_DURATION_MINUTES~SUSPECT_SEX, data=sample_sex_stopt, equal.var=FALSE)
##
## Welch Two Sample t-test
##
## data: STOP_DURATION_MINUTES by SUSPECT_SEX
## t = 7.4806, df = 1150.1, p-value = 1.464e-13
## alternative hypothesis: true difference in means between group FEMALE and group MALE is not equal to 0
## 95 percent confidence interval:
## 1.746983 2.989192
## sample estimates:
## mean in group FEMALE mean in group MALE
## 12.230847 9.862759
# We are still rejecting the null hypothesis because the p-value is higher than 0.05. From the 2 tests above, we could conclude there is no statistical significant difference in stop duration minutes of male and female suspects.